Skip to contents




Introduction

Mass Spectrometry (MS) and Raman spectroscopy (Raman) are orthogonal techniques. While MS measures the chemical composition of a given sample, Raman is able to provide structural information of the constituents. This orthogonality is interesting when applying quality evaluation workflows, where both chemical composition and structural integrity are relevant. In this article, we show how MS and Raman can be pre-processed and fused for a combined statistical evaluation to support quality control. As test sets, we use four different bovine serum albumin (BSA) products where we evaluate the quality of the active product ingredient (API).

Pre-processing

The preliminary processing is applied differently for MS and Raman data as the data structure does not allow direct fusion of the data. Therefore, in the following two sub-chapters we describe the pre-processing workflow steps for MS and Raman data.

MS

The four different BSA products were measured in triplicate in full scan MS mode. Giving the following set of .d files.

# Raw data from an Agilent QTOF
basename(ms_files_d)
 [1] "BSA_7AqYf_1.d" "BSA_7AqYf_2.d" "BSA_7AqYf_3.d" "BSA_m595c_1.d"
 [5] "BSA_m595c_2.d" "BSA_m595c_3.d" "BSA_utpLV_1.d" "BSA_utpLV_2.d"
 [9] "BSA_utpLV_3.d" "BSA_wAgwS_1.d" "BSA_wAgwS_2.d" "BSA_wAgwS_3.d"
[13] "H2O.d"        

The first step is conversion to open source mzML format using MSConvert from ProteoWizard CLI, as shown below.

# Add option to centroid MS1 data before conversion
optList <- list(filter = "peakPicking vendor msLevel=1")

# convert_ms_files(ms_files_d, outputFormat = "mzML", optList = optList)

After conversion, the mzML files are created in the same directory. The converted files are listed below.

basename(ms_files)
 [1] "BSA_7AqYf_1.mzML" "BSA_7AqYf_2.mzML" "BSA_7AqYf_3.mzML" "BSA_m595c_1.mzML"
 [5] "BSA_m595c_2.mzML" "BSA_m595c_3.mzML" "BSA_utpLV_1.mzML" "BSA_utpLV_2.mzML"
 [9] "BSA_utpLV_3.mzML" "BSA_wAgwS_1.mzML" "BSA_wAgwS_2.mzML" "BSA_wAgwS_3.mzML"
[13] "H2O.mzML"        

With the mzML files, a MassSpecEngine is created to pre-process the MS data and the analysis replicate names are modified to properly assign the analyses within each replicate measurement.

ms <- MassSpecEngine$new(analyses = ms_files, headers = list(name = "BSA Quality Evaluation Project"))

# Changes the analysis replicate names
ms$analyses$replicates <- sub("_\\d+$", "", ms$analyses$names)

# Overview
ms

MassSpecEngine
name          NA
date          2024-10-27 13:02:14.353752
file          NA

Workflow empty 

StreamFind::MassSpecAnalyses
       analysis replicate  blank   type polarity spectra chromatograms
         <char>    <char> <char> <char>   <char>   <num>         <num>
 1: BSA_7AqYf_1 BSA_7AqYf   <NA>     MS positive     591             7
 2: BSA_7AqYf_2 BSA_7AqYf   <NA>     MS positive     591             7
 3: BSA_7AqYf_3 BSA_7AqYf   <NA>     MS positive     591             7
 4: BSA_m595c_1 BSA_m595c   <NA>     MS positive     592             7
 5: BSA_m595c_2 BSA_m595c   <NA>     MS positive     591             7
 6: BSA_m595c_3 BSA_m595c   <NA>     MS positive     592             7
 7: BSA_utpLV_1 BSA_utpLV   <NA>     MS positive     591             7
 8: BSA_utpLV_2 BSA_utpLV   <NA>     MS positive     591             7
 9: BSA_utpLV_3 BSA_utpLV   <NA>     MS positive     591             7
10: BSA_wAgwS_1 BSA_wAgwS   <NA>     MS positive     592             7
11: BSA_wAgwS_2 BSA_wAgwS   <NA>     MS positive     591             7
12: BSA_wAgwS_3 BSA_wAgwS   <NA>     MS positive     591             7
13:         H2O       H2O   <NA>     MS positive     591             7

Results empty 
# Note that the TIC chromatograms are not yet loaded (see below)
# In plot_spectra_tic, the TIC from spectra headers are used
ms$plot_spectra_tic(colorBy = "replicates")

Total ion chromatograms (TIC) for each analysis replicate.

The data processing workflow is based on applied processing methods within the engine. As pre-processing, the following steps are applied:

  1. Find the retention time elution of BSA based on the total ion chromatogram (TIC) of each analysis;
  2. Find the mass-to-charge (m/z) range of BSA for deconvolution of charges;
  3. Deconvolute charges for estimation of the exact mass, in DA;

The individual processing steps are applied according to ProcessingSettings. Below, the integration of the TIC is shown using the dedicated ProcessingSettings after loading the TIC chromatograms from the analysis files and smoothing using a moving average approach.

ms$load_chromatograms(chromatograms = "TIC")
ms$analyses$has_chromatograms
[1] TRUE
# Smoothing of the TIC chromatogram for improving peak finding
ms$run(MassSpecSettings_SmoothChromatograms_movingaverage(windowSize = 5))
ms$plot_chromatograms(colorBy = "replicates")

Smoothed TICs for each analysis replicate.

# Settings for integration of the TIC
ps_ic <- MassSpecSettings_IntegrateChromatograms_StreamFind(
  merge = TRUE,
  closeByThreshold = 5,
  minPeakHeight = 3E6,
  minPeakDistance = 3,
  minPeakWidth = 20,
  maxPeakWidth = 200,
  minSN = 5
)

show(ps_ic)

 StreamFind::MassSpecSettings_IntegrateChromatograms_StreamFind 
 engine       MassSpec
 method       IntegrateChromatograms
 algorithm    StreamFind
 version      0.2.0
 software     StreamFind
 developer    Ricardo Cunha
 contact      cunha@iuta.de
 link         https://odea-project.github.io/StreamFind
 doi          NA

 parameters: 
  -  merge TRUE 
  -  closeByThreshold 5 
  -  minPeakHeight 3e+06 
  -  minPeakDistance 3 
  -  minPeakWidth 20 
  -  maxPeakWidth 200 
  -  minSN 5 
ms$run(ps_ic)
ms$plot_chromatograms_peaks(colorBy = "replicates")

TIC with integrated peaks.

The average retention time of the main peak was 363.9823581 seconds with a standard deviation of 0.8060852. The following step is to estimate the m/z range of BSA for deconvolution of charges.

# Average retention time of the main peak
rt_av <- mean(data.table::rbindlist(ms$chromatograms$peaks)$rt)
rt_av
[1] 363.9824
ms$plot_spectra_ms1(rt = rt_av, sec = 0.5, colorBy = "replicates",
  presence = 0.1, minIntensity = 5000, interactive = FALSE
)
Spectra of BSA for each analysis replicate at the peak maximum.

Spectra of BSA for each analysis replicate at the peak maximum.

According to the MS1 spectra, the m/z from 1200 to 2000 can be added to the dedicated ProcessingSettings for deconvolution, as shown below. The workflow is an ordered list of ProcessingSettings objects for each workflow step, which is added to the engine. The exact parameters for each method can be optimized individually by calling the specific method directly. Once the settings are added, the workflow can be inspected with the print_workflow() method.

ms_workflow <- StreamFind::Workflow(
  list(
    
    MassSpecSettings_LoadSpectra_StreamFind(
      mzmin = 1200,
      mzmax = 2000,
      rtmin = rt_av - 1,
      rtmax = rt_av + 1,
      levels = 1
    ),
    
    MassSpecSettings_ClusterSpectra_StreamFind(
      val = "mz",
      clustVal = 0.01,
      presence = 0.2
    ),
    
    MassSpecSettings_CalculateSpectraCharges_StreamFind(
      roundVal = 15,
      relLowCut = 0.2,
      absLowCut = 8000
    ),
    
    MassSpecSettings_DeconvoluteSpectra_StreamFind(),
    
    MassSpecSettings_SmoothSpectra_movingaverage(
      windowSize = 15
    ),
    
    MassSpecSettings_BinSpectra_StreamFind(
      binNames = c("rt", "mass"),
      binValues = c(5, 20),
      byUnit = TRUE,
      refBinAnalysis = 1
    ),
    
    MassSpecSettings_AverageSpectra_StreamFind(),
    
    MassSpecSettings_NormalizeSpectra_minmax(),
    
    MassSpecSettings_NormalizeSpectra_blockweight()
  )
)

# Replaces the settings added to integrate the TIC chromatograms
ms$workflow <- ms_workflow

ms$print_workflow()

Workflow
 1: LoadSpectra (StreamFind)
 2: ClusterSpectra (StreamFind)
 3: CalculateSpectraCharges (StreamFind)
 4: DeconvoluteSpectra (StreamFind)
 5: SmoothSpectra (movingaverage)
 6: BinSpectra (StreamFind)
 7: AverageSpectra (StreamFind)
 8: NormalizeSpectra (minmax)
 9: NormalizeSpectra (blockweight)

The run_workflow() method is called to run the data processing.

ms$run_workflow()
plot_spectra(ms$analyses, colorBy = "analyses")

Deconvoluted and pre-processed spectra for each analysis.

Raman

The four different BSA products were measured using an online coupling of size exclusion chromatography (SEC) with capillary-enhanced Raman spectroscopy (CERS) through a liquid-core optical fibre flow cell, as reported by Thissen et al. (DOI: 10.1021/acs.analchem.3c03991). The raw spectra was converted to asc format for each retention time. A subset of the background and the main BSA elution peak was selected for pre-processing. The first pre-processing step is to merge the individual asc files into a main file for each BSA product, as shown below.

# Engine with all asc files for each spectrum and each BSA product
raman <- RamanEngine$new(raman_files)

# Folder names of each BSA product
base_dirs <- basename(dirname(raman_files))

# The folder name is used as replicate name, which is also used as file name for the unified data file
raman$add_replicate_names(base_dirs)

# The spectra are merged into a time series for each BSA product
raman$merge_spectra_time_series()

A new engine is created with the unified file for each BSA product and then plotted as a time series.

raman <- RamanEngine$new(analyses = raman_unified_files)

# Modifying the replicate names to match the MS analyses
raman$analyses$replicates <- unique(ms$analyses$replicates)[1:4]

raman

RamanEngine
name          NA
date          2024-10-27 13:02:38.37278
file          NA

Workflow empty 

StreamFind::RamanAnalyses
          analysis replicate  blank   type spectra
            <char>    <char> <char> <char>   <num>
1: BSA_7AqYf_Raman BSA_7AqYf   <NA>  raman 4093952
2: BSA_m595c_Raman BSA_m595c   <NA>  raman 4093952
3: BSA_utpLV_Raman BSA_utpLV   <NA>  raman 4093952
4: BSA_wAgwS_Raman BSA_wAgwS   <NA>  raman 4093952

Results empty 
raman$plot_chromatograms(yLab = "Abbundance sum for each spectrum", colorBy = "analyses")

Time series chromatogram for each BSA product. Each data point is a Raman spectrum at a specific retention time.

raman$plot_spectra(rt = c(430, 440), colorBy = "analyses")

Raw averaged Raman spectra for each BSA product between 430 and 440 seconds.

As for MS data, the workflow for Raman data is also assembled using an ordered list of processing settings, as shown below.

raman_workflow <- StreamFind::Workflow(
  list(
    RamanSettings_BinSpectra_StreamFind(
      binNames = "rt",
      binValues = 20,
      byUnit = FALSE
    ),
    
    RamanSettings_SubtractSpectraSection_StreamFind(
      sectionVal = "rt",
      sectionWindow = c(0, 200)
    ),
    
    RamanSettings_DeleteSpectraSection_StreamFind(
      shiftmin = -100,
      shiftmax = 315
    ),
    
    RamanSettings_SmoothSpectra_savgol(
      fl = 11,
      forder = 2,
      dorder = 0
    ),
    
    RamanSettings_DeleteSpectraSection_StreamFind(
      shiftmin = 315,
      shiftmax = 330
    ),
    
    RamanSettings_DeleteSpectraSection_StreamFind(
      shiftmin = 2300,
      shiftmax = 2600
    ),
    
    RamanSettings_CorrectSpectraBaseline_baseline_als(
      lambda = 3,
      p = 0.015,
      maxit = 10
    ),
    
    RamanSettings_DeleteSpectraSection_StreamFind(
      rtmin = 0,
      rtmax = 420
    ),
    
    RamanSettings_DeleteSpectraSection_StreamFind(
      rtmin = 430,
      rtmax = 900
    ),
    
    RamanSettings_AverageSpectra_StreamFind(),
    
    RamanSettings_NormalizeSpectra_minmax()
  ) 
)

raman$workflow <- raman_workflow

raman$print_workflow()

Workflow
 1: BinSpectra (StreamFind)
 2: SubtractSpectraSection (StreamFind)
 3: DeleteSpectraSection (StreamFind)
 4: SmoothSpectra (savgol)
 5: DeleteSpectraSection (StreamFind)
 6: DeleteSpectraSection (StreamFind)
 7: CorrectSpectraBaseline (baseline_als)
 8: DeleteSpectraSection (StreamFind)
 9: DeleteSpectraSection (StreamFind)
 10: AverageSpectra (StreamFind)
 11: NormalizeSpectra (minmax)

A particularity of the workflow for Raman data is that the background subtraction (workflow step number 3) is performed from a time region (i.e., between 0 and 3 minutes) before the main BSA peak, which is from approximately from 5 to 8 minutes. The workflow is then applied using the run_workflow() method.

raman$run_workflow()
raman$plot_spectra_baseline()

Baseline correction for each spectra.

raman$plot_spectra()

Processed Raman spectra for each BSA product.

Fusion

The fusion of MS and Raman data is performed by combining the processed data from both engines. The data is then merged into a single data frame for statistical evaluation using a StatisticEngine, as shown in the next chapter.

ms_mat <- get_spectra_matrix(ms$analyses)
raman_mat <- get_spectra_matrix(raman$analyses)
fused_mat <- cbind(ms_mat, raman_mat)
attr(fused_mat, "xaxis.name") = "Keys"
attr(fused_mat, "xaxis.values") = seq_len(ncol(fused_mat))
str(fused_mat)
 num [1:4, 1:945] 0.01453 0.01554 0.00166 0.00643 0.01665 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:4] "BSA_7AqYf" "BSA_m595c" "BSA_utpLV" "BSA_wAgwS"
  ..$ : chr [1:945] "364-65465" "364-65485" "364-65505" "364-65525" ...
 - attr(*, "xaxis.name")= chr "Keys"
 - attr(*, "xaxis.values")= int [1:945] 1 2 3 4 5 6 7 8 9 10 ...

Statistic analysis

A statistic analysis is performed with the StatisticEngine for evaluation of the differences between the BSA products.

PCA

The fused data is used to create a PCA model with the StatisticEngine using processing settings based on the mdatools package.

stat <- StatisticEngine$new(analyses = fused_mat)
# Plots the data from the added fused data.frame/matrix
plot_data(stat$analyses)
stat$run(StatisticSettings_MakeModel_pca_mdatools(ncomp = 2, alpha = 0.05, gamma = 0.02))
summary(stat$model) # summary method is from mdatools

Summary for PCA model (class pca)
Type of limits: ddmoments
Alpha: 0.05
Gamma: 0.02

       Eigenvals Expvar Cumexpvar Nq Nh
Comp 1   141.860  98.50      98.5 19 56
Comp 2     1.143   0.79      99.3  7 56
plot(stat$model)

Overview plot for the PCA model.

# Adds a color key to the loadings plot to evaluate the contribution of each dataset (i.e., MS and Raman)
plot_loadings(stat$analyses, colorKey = rep(c("MS", "Raman"), c(ncol(ms_mat), ncol(raman_mat))))

PCA loadings plot for the fused data.

# The mdatools S3 class model object is obtained via the active binding of the StatisticEngine
plot(stat$model$model, show.labels = TRUE) # plot method from mdatools
PCA summary plot using the original mdatools R package.

PCA summary plot using the original mdatools R package.

MCR purity

Another approach for the statistic analysis is to use the MCR pure model (based on mdatools R package) to evaluate the purity level in the analyses.

# Note that the model in the engine will be overwritten by the MCR model
stat$run(StatisticSettings_MakeModel_mcrpure_mdatools(ncomp = 2, offset = 0.05))
summary(stat$model)

Summary for MCR Purity case (class mcrpure)
       Expvar Cumexpvar Varindex Purity
Comp 1  45.60     45.60      235 24.018
Comp 2  53.68     99.28      379 44.456
plot(stat$model)

Overview plot of the MCR pure model.

# Gets the most pure features for each component
get_model_data(stat$model)$purity
    vars     vals feature result
   <int>    <num>  <char> <char>
1:   235 24.01823 709.766  model
2:   379 44.45558 1092.64  model